-
Notifications
You must be signed in to change notification settings - Fork 11
Add timeout to gRPC requests #72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add timeout to gRPC requests #72
Conversation
06b7577 to
5cb47e9
Compare
|
This timeout is now increased dynamically with each failed connection attempt. |
tiyash-basu-frequenz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the grpc_call_with_timeout is invoked per call. Maybe I am wrong, but wasn't the idea to implement the backoff at the client level, and not at a call level?
If we indeed need to implement it at a client level, then we could send the commands via a channel to a central client task, that can impose a global backoff logic. It would also need logic to reset the backoff after a certain period passes since the last successful RPC call.
|
Hmm, fair point that the timeout could be increased at the client level. That said, some gRPCs inherently take much longer than others (e.g., list_gridpool_orders), which is why I was thinking of having this handled at the gRPC level. However, I’m definitely not opposed to managing it at the client level either. For me, the most important thing is that we have a timeout in place (whether it’s dynamically decided or fixed) to avoid hanging calls. |
|
I see. Different RPCs having different timeouts needs to be considered, true. The main objective of timeouts would be to reduce stress on the client's connection. Having per-call timeout violates this principle. One call might be backing off, but then a new call might go straight through. So my guess is that while you may see some benefits of this approach right away, it would be a matter of chance when the connection to the server gets stressed. The more client apps, the further it will regress. |
ff2e5ff to
a0eb107
Compare
|
I currently have a |
tiyash-basu-frequenz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The addition itself looks good. I have a few improvement suggestions.
a0eb107 to
cb447b8
Compare
tiyash-basu-frequenz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good! I would just suggest adding a few tests to nail down stability and prevent regression, and done.
cb447b8 to
2c94c28
Compare
2c94c28 to
7797f8d
Compare
|
Sorry, I'm just realizing this has become a retry implementation for non-streaming calls. Retries of this sort have to be infinite because we have to keep trying until the API call succeeds, and doing that in the client would take away the actor's agency in deciding what to do when there is a client error, like taking fallback action, sending a slack message, etc. It also introduces the risk of creating orders with outdated power or price forecasts if the client's retry succeeds after a few minutes. For this reason, all of our clients should have retries only for streaming APIs (which are taken care of by the For non-streaming APIs, retries can be implemented in the actors using the Also, I've implemented these in the actor for now in this PR: https://github.com/frequenz-io/frequenz-actor-electricity-trading/pull/249 So, I think this PR should go back to just implementing a timeout for non-streaming methods to prevent them from blocking for too long and limiting the actor's ability to take fallback action. |
|
Then all that needs to be done is invoke only simple RPCs with the backoff, and call the streaming ones directly. If a backoff module exists in the base client, then that could be used directly (and sorry that you implemented it here @camille-bouvy-frequenz). I also would not recommend going back to a non-global backoff, since I do not see a value in that. |
Not in the client, for the reasons mentioned above. We just need timeout to be implemented in this PR.
Top level retry is provided by the actor implementation. That gives the most flexibility. And there are plans to add additional backoff mechanisms and failure detection based on run times, etc. And then it is needed only in one place. If users need additional control, they can do it in their actors. |
32305d4 to
f5cdbe7
Compare
|
|
||
|
|
||
| async def grpc_call_with_timeout( | ||
| call: Callable[..., Awaitable[Any]], *args: Any, timeout: float = 300, **kwargs: Any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should take a timedelta | None and the default value should be None here and in the other calls. Otherwise it wouldn't be intuitive and not what you'd expect from a thin wrapper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree about the timedelta but I don't think that the timeout should be None by default. Else we risk having the API stuck again (without giving any errors) like we did last week
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the actor would set it when making calls to the client. The actor can get the value to use from the config file.
The 300s comes from specific issues we noticed in the server and what we assume the actor needs. We can't say that 300 is the right value in the long term. It could become different even as soon as we go to production.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No the 300s was arbitrary just to have some timeout in the API requests. And the issues didn't come from the actor's side but from the API side when it disconnected (without throwing any errors). I don't mind changing the 300s to another value, but I'd feel more comfortable having a timeout by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No the 300s was arbitrary just to have some timeout in the API requests.
This is just the thing. Defaults have to make sense, and we can't find a value that does, especially when we expect external parties to use it. Maybe we shouldn't add this here at all and keep it only in the actors? If, at some point, we get an SDK-like layer between the client and the actor, these problems would become much easier.
Maybe @llucax has an opinion? Because we try to be consistent with all the API clients.
In the worst case, we go ahead with 300s as the default if it works for most cases.
For example, the default receiver buffer size of 50 is a bit arbitrary and has bitten me sometimes. It didn't occur to me for a long time that it was overflowing and needed a bigger buffer. But still, having 50 as a default has been very useful in most places. The situation here is very different because None as buffer size would have meant no buffer, whereas here None as timeout means infinite timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea ok I get your point. I can then set the default to None for now then & just make sure we call the functions with timeouts in the actors
So if I understand it right, you specifically are suggesting having per-RPC backoffs in this client? If yes, how would that help? Nice that a global backoff module exists in the actor library! |
No, I'm saying we shouldn't add automatic retry in the client for non-streaming RPCs. There is no question of backoff if there's no retry. I'm only suggesting that we have timeout for such RPCs. |
Signed-off-by: camille-bouvy-frequenz <[email protected]>
f5cdbe7 to
cbc6978
Compare
shsms
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an optional comment.
Signed-off-by: camille-bouvy-frequenz <[email protected]>
cbc6978 to
87f01cd
Compare
This PR introduces a backoff module that handles function calls with configurable timeout, which also increases with every timed-out call. This is then used for all non-streaming gRPC-related methods of the Client. The default timeout is set to 300 seconds (i.e. 5min), with flexibility to adjust as needed.